The purpose of this project is to explore the capabilities of social
media text-based age prediction by developing a machine learning model
to predict which age group a user belongs to by studying blog posts
written by them. I will be framing this task as a binary classification
problem, splitting users into 2 age groups: Age < 18 and
Age = 18+. I will be applying various methods to the Blog
Authorship Corpus dataset on Kaggle to achieve the most accurate
model.
In the past few years, especially due the pandemic, social media usage has skyrocketed into one of the most popular ways of communication in daily life for people across many age groups. We are able to connect with people from all walks of life and generations. This had me wondering: with this enormous quantity of textual information from such a diverse group of people, would it be possible to aggregate a segment of this social media data and examine if there are differences in linguistic structures/speech patterns, stylistic features, and content interests among different age groups?
Obtaining an accurate model for social media authorship attribution can be applied to many areas of the digital world. With the exceptional growth of social media, we also see a significant increase in the demand for automatically analyzing online messages from the tremendous amounts of social media users for safety. Another application of this model is the process of suggesting/recommending age-appropriate services and content to users. Another problem that can be solved by accurate age prediction is the ability to detect harmful online conduct, specifically this model is geared towards combating online grooming. Studies have shown that children are engaging with social media more than ever, so it is imperative to improve internet safety.
Now that we have examined the significance of this model, let’s lay
out our plan of action for building and training our model throughout
this project. First, we begin by loading and cleaning our data by
rectifying potential problems caused by missing data or text formatting.
I will also remove any predictor variables that are irrelevant to the
goal of this project. I will then add additional variables that
processes the text for each post and enumerates phrases/keywords that
may help indicate age and stage of life (i.e. profanity, career-related
terms, school affiliated phrases, etc.). I will also be using sentiment
analysis to explore whether people of one age group tend to text with
more negative/positive sentiment score than the other. I will also be
adding variables to consider linguistic features such as punctuation to
word ratio, slang and abbreviation usage, and average blog length. We
will use all of these variables as predictors and the age group as our
response. We are trying to predict whether the author of a blog post is
under 18 years old or 18 years and older, so I will be adding a variable
to categorize bloggers into a factor with 2 levels. After completing all
initial data manipulation and information extraction, we can proceed to
perform a training/test split on our data, stratifying on the
age group variable. Next, I will explore if there are any
highly correlated variables and then create a recipe. Now, I can proceed
to setting folds for a 10-fold cross validation and setting up a grid of
possible values to consider for tuning model. We will then fit the
following models: Logistic Regression, Lasso, Ridge, Elastic Net, Linear
Discriminant Analysis, Quadratic Discriminant Analysis, Gradient-Boosted
Trees, Random Forest, and K-Nearest Neighbor. The best performing model
will be fit to our testing data set. With that, we will conclude by
analyzing the efficacy and accuracy of the results of our final
model.
We begin by loading all the packages we will be using throughout this project.
# loading required packages
library(ISLR)
library(ISLR2)
library(tidyverse)
library(dplyr)
library(tidymodels)
library(readr)
library(kknn)
library(ggplot2)
library(corrplot)
library(corrr)
library(ggthemes)
library(kableExtra)
library(discrim)
library(tidytext)
library(stringr)
library(tidyr)
library(scales)
library(SentimentAnalysis)
library(naniar)
tidymodels_prefer()
set.seed(1213)
Now, loading the main data set we are using, the Blog Authorship Corpus from Kaggle.
# assigning the data to a variable
blog_data <- read.csv("/Users/annalin/Desktop/blogtext1.csv", skipNul = TRUE)
# see preview of the data set
blog_data[70:100, 1:7]%>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "200px")
| id | gender | age | topic | sign | date | text | |
|---|---|---|---|---|---|---|---|
| 70 | 3581210 | male | 33 | InvestmentBanking | Aquarius | 10,August,2004 | Korea’s pretty funny sometimes. Take a look at urlLink this article about changing the age of majority from 20 (already a year younger than the U.S.) to 19 (the same as British Columbia, Canada, but higher than the 18 that is prevalen federally and in some Canadian provinces). Here parents are commenting on the ability to take on rights and responsibilities at the tender age of 19. ‘My boy is almost an adult now and the law now even considers him a man, but he has to learn his responsibilities and accept his own choices in life,’ Han Ho-sun, mother of a 20-year-old, said. ‘But what does he know? He is not ready to make decisions as an adult yet. And if he is granted too much freedom, I’m worried he’ll get himself into trouble.’ Other parents of young adults agreed. They nodded their heads to the idea that the government, before lowering the adult age, should provide training and programs to teach what adult responsibilities entail - such as building up a good credit record, making political decisions and getting married. ’ What? 19 years? Kids that age aren’t really mature enough to handle anything! Let alone complete adulthood,’ said one parent who asked for anonymity. The first two seem pretty sensible, but the last one (if the name wasn’t withheld) would be prime fodder for the urlLink Tonight Show’s Headlines segment, I think. Then there’s urlLink this article about the capture of an ex-convict suspected of stabbing to death two police offices a few days ago. This part deserves special attention: As police raided the house, Lee injured himself by stabbing his abdomen and thighs several times while holding a child and a 48-year-old woman as hostages to resist arrest. He was immediately sent to Ehwa Womans University Mok-dong Hospital for emergency treatment… What? Did I just read that right? The suspect stabbed himself enough to need to go to the emergency room? What was he trying to do? Commit suicide so as to evade capture?!??! Methink perhaps the (can I say vengeful?) police on the scene may have helped him guide the knife just a bit . Maybe he did stab himself, but it sure seems weird. urlLink Here’s an op/ed on the capture and how it was handled by the police. |
| 71 | 3581210 | male | 33 | InvestmentBanking | Aquarius | 09,August,2004 | Ya, I’m off to Canada/Vancouver again soon…ah, to be back in the land of fresh air, real mountains and diapers. Luckily, I think my oldest boy is out potty-trained now. Not sure, though, haven’t been home for about 2 months…and a lot can happen in that time. Now that we’ve done a deal here my boss/CEO would have me here all of August (well, all of the year, actually) but I have to go back as it’s my younger son’s first birthday, or urlLink 첫돌/Chot-dol (they have a different name for almost everything here). For Koreans the 1st and 60th birthdays are the biggies. 1st because the child made it through the often-treacherous first year (remember, Korea used to be really, really poor…so much so that is was a 100-days celebration as well, but that is starting to become less important as more babies are living past it). 60th because the urlLink Korean Zodiac is similar to the Chinese one…with 12 animals (ya, I’m a pig). There are also five colors which elude me at the moment, although I know two of them are black and gold. The Chinese use five elements: metal, earth, wood, fire and water. This year is urlLink Year of the Wood Monkey , which I guess is better than 1981… urlLink Year of the Wood Cock/Rooster . (Hehe, had to put that in there.) Anyways, at age 60 you’ve gone through all 12 years 5 times for each of the colors/elements…a truly magical time, I guess. Thus, by age 60 men (no mention of women) should have at least one grandchild, preferably a grandSON–one more reason to be happy that we had our oldest boy in my wife’s father’s 60th year! Now about foreingers (as I/we call all non-Koreans…even in Canada for some reason). Korea ain’t called urlLink The Hermit Kingdom for nothin’. Basically, there are very, very few foreigners here. Something like 200,000 in a country of 48 Million. Most of us are (like me) in Seoul and (not like me) teachers. Others are manual labour workers from China, India and Southeast Asia (doing the urlLink 3D-dirty, dangerous, difficult-jobs that Koreans feel they no longer need to perform en mass). Foreigners (or 외국인/weigook-in or 외국사람/weigook-saram) are treated quite differently from native Koreans (and likely for good reason in some cases). We all need to register with our local 구/ku (ward/borough) office and do so (i) every 12 months or (ii) if we move to another 구 (whichever comes first). We get an Alien Registration Card that has our pic and address on it as well as our visa type (teachers are E-2, I fall under E-7 or ‘other worker, director’). We need the ARC to get a job, get a bank account or do almost everything else official here. But it’s not the same as a Korean Citizenship Card. Some promotions, for example the bonus card for urlLink Fantaseum , is only for Koreans, not me. So I can’t get 10% off movies or get bonus points there…waaaaaaaaa! Also, urlLink foreigners cannot easily buy land in Korea (although I’ve also heard that it just can’t be done…not sure who to believe on this). urlLink Foreign investment companies, however, still did so –and did in great amounts after the 1997 financial crisis. (In fact, they were the first firms in Korea to EVER use due diligence/research before buying and developing land…Korean execs, it seems, did it based more on intuition than anything else before then.) It’s also hard for foreign firms to get market share here–unless they partner with a local firm such as Otis did with LG Group concerning elevators and escalators. Heck, foreign car makers have just urlLink 3% of the local market ! (Not counting GM’s recent purchase of Daewoo Motors, which is a small player anyways.) One thing foreigners have had some success in doing is buying all or parts of troubled Korean companies. Korean firms put off foreign investment as long as possible, but in the end they can take on foreign investors– urlLink 대한생명/DaeHan sengmyung/Life aka Korea Life Insurance is 5% owned by Australia’s largest bank, urlLink Macquarie Bank . Korean firms, however are loathe, it seems, to disclose what most purchasers would consider to be important facts about the businesses. urlLink 교보생명/Kyobo Sengmyung/Life apparently had months of due diligence done on it by urlLink buyout firms , but in the end not enough was known to do a deal. Foreign firms have been able to urlLink buy billions of USD in Korean stocks , urlLink SK Life , urlLink large portions of Korean banks , urlLink other firms and one ( urlLink PCA ) is urlLink in the running to buy urlLink 대한투자증권/DaeHan Tooja Jeunkwon/DaeHan ITC/Investment Trust Company aka Daetoo (names here can get pretty long and complicated) though, so there is some opening up of this place. Funny, the longer I’m here, the more normal all of this seems…not sure that is a good thing. |
| 72 | 3581210 | male | 33 | InvestmentBanking | Aquarius | 09,August,2004 | Ah, finally…someone else I know is coming here from Canada for a visit. First was my cousin, Lars in May of 2003. He came at the perfect time. Weather was getting warm, but not unbearably hot and urlLink Children’s Day and Buddha’s Birthday were celebrated (day off) during his time here too. That was cool. Here are some pics from his visit. We did almost everything you could as a tourist for a short visit (but light on the bar scene as I was not as educated as I am now on it…and he was so young and impressionable at the time). urlLink Welcome to Korea…you must have SARS; Seoul City Hall traffic square; urlLink Myoungdong on the weekend; kids and wooden swords @ Korean Heritage Museum; us at the urlLink Korean Folk Village ; on urlLink the bus heading to urlLink Incheon Airport . He had a fever (from a normal cold) a few days before he got here, so that and the general lack on English language ability by almost everyone here lead him to be delayed getting out of the airport by over an hour. Next time I’ll give my visitors a little note in Korean to show the airport staff saying: ‘I’m a top gamer from _______ please make sure I don’t miss my tournament and speed me through customs quickly.’ or some such thing–Koreans love their online games here; they even have a TV station dedicated to it. Lars had this to report about Korea (and he’s flown about 1,000,000 miles thus far): it’s similar to Europe, but just LOTS more people. The second person was a guy I met through a teacher in urlLink Pusan/Busan I know through a friend of mine in urlLink Nanaimo (my old roommate, actually; funny how things seem to dovetail as we get on in years). Anyways, her sister’s bud is getting married to a gal in urlLink Taegu/Daegu and wanted some pointers on how to not screw-up in front of the in-laws-to-be. He was here for a week or so and in Seoul for a couple of days…but I think urlLink we crammed a week of drinking into one night . urlLink Yusheng should have a good time here and should be here sometime in the fall, date to be determined. The guy’s some kinda world traveler and he takes urlLink some pretty cool pic of architecture wherever he goes. I’ve noticed that many of them are devoid of human presence…something he’d be challenged to do in Seoul. There’s always people around you here no matter the time or place. If you want to see some cool shots of my other hometown, Vancouver, take a look urlLink here . Ok, time to get my urlLink pubmaps out and plan Yusheng’s course. |
| 73 | 3581210 | male | 33 | InvestmentBanking | Aquarius | 08,August,2004 | I think if I’m going to claim 여의도/Yeouido as my new hometown I should know something about it…and by extension you should too. I’ve got a little more to add to my urlLink background on it thus far . As you may know, ~do can mean two things in Korean. One is: province, such as 경기도/Kyunggi-do. The other is island, such as 제주도/Jeju-do (also spelled Cheju-do, depending on what era the map or publication was edited in…things changed in 2002, just in time to confuse the heck out of tourists coming here for the World Cup). Since Yeouido (also spelled Youido, Yoido and Yeoido) is pretty tiny and part of 영등포구/Youngdeungpo-ku (ku is like a ward or borough such as Queen’s or The Bronx) the ~do here signifies that it’s an island. I had no idea that its ~do was ‘island’ because it looks as though they’ve filled in or paved over much of the water that once separated Yeouido from the mainland. Anyways, way back when Yeouido used to be on the outskirts of Seoul and it housed the countries first real airport. (Later urlLink Gimpo/Kimpo Airport was built in the new outskirts of Seoul and now, since the city is not getting any smaller, urlLink Incheon Airport is the new International airport for Seoul; although Kimpo still handles much of the domestic traffic.) In fact, urlLink Yeouido Park , which until 1996 or so was a slab of pavement instead of the picturesque park that it is, was the runway of that airport! So, here in West Yeouido, near the park, is where the streets are named 양말산/Yang-mal-san or ‘sock mountain’. I got to thinking why it would be called sock mountain…there wasn’t a sock factory here to my knowledge and I hadn’t heard a story about removing one’s socks for any reason (although I wouldn’t dismiss it if I heard it; there are so many different traditions here). When I heard about the airport I came up with this theory: sock mountain is the hill (here they call a mountain what we in Vancouver would call a knoll, I find) where urlLink the windsock was placed. Make sense? For more exciting and ground-breaking news stay tuned to this blog. |
| 74 | 3581210 | male | 33 | InvestmentBanking | Aquarius | 08,August,2004 | Being gay in Korea is like being gay in the U.S. military…in 1950. Let’s just say that it’s not near as lovingly embraced as in the West/US. Gays marrying in Korea? C’mon! There are no Korean gays! Of course, if one is gay one can hardly be Korean. I’m not trying to piss of PCers out there, this is just the way many Koreans (especially men and older ones) think. Although the younger generation are more open to other ideas, it’s the older generation that controls politics, education, finance, and even media and culture to a large degree. The CEOs of the TV stations are all men and all old guys, as an example. Things are changing, though, as evidenced in this poster for a gay bar in Itaewon. (Now that you know its name you can either (i) know where to go or (ii) know where to avoid depending on your preference.) urlLink Yikes! Here are a few websites on gays in Korea: urlLink here and urlLink here . urlLink This blog has a tongue-in-cheek explanation of what being gay is…basically a primer in ’ urlLink gaydar ’ (how to recognize a gay in Korea or elsewhere). However, I think that many of the guys here have gay characteristics. It’s pretty weird sometimes. In a business meeting one guy from a partner firm put his hand on my leg (just above the knee) as he spoke of my experience in finance and what I do for the investment process. This would be outrightly gay, but in Korea it’s ok…it just shows (non-sexual) fondness and a measure of trust between two people. My boss does it also…much to the surprise of foreigners in our group…but, again, it’s not ‘gay’ its just the culture here. My wife’s uncle holds my hand, too…and not for 2-3 seconds; for 2-3 minutes. At first I thought it was pretty weird, but it just means that he really likes me. (Having said all this I know I’ll have to put up with a deluge of gay jokes when I see my buds back in Canada, crap.) Over here guys touch guys and girls touch girls, but it’s just not sexual. Many young gals (especially) can be seen holding hands while walking down the road. Add to that the schoolgirl uniform and you have a formula for stimulation for most Western guys. You get used to it, though. I really wonder about some of these Korean guys, though. urlLink The Face Shop has a model on all of their (make-up) stores that’s pretty gay looking. There are others as well. These guys are (I hope) more urlLink metrosexual than homosexual, although the distinction is greying. Many of the boy bands have similar dudes in them-the clothes, the hair, the dance moves. I got to really wonder sometimes. (Remember urlLink this guy? ) Perhaps because of the stigma attached to being gay in Korea most gays don’t come out of the closet–choosing instead to marry, have kids and be a ‘normal’ person or going into singing or acting or modeling and making the excuse that they’re too busy with their career to find a wife. Either way, it’s quite different from in the West where there is more acceptance of such things. On another topic: yesterday I had lunch with my CEO/boss and he mentioned that China and Korea were having a urlLink tiff about the Gorguyeo Dynasty . He said, ‘oh, how can you know about that…it is just Korean history’. Well, when I said that is was about 50AD-600AD and was concurrent with the Paekche and Shilla empires and before the United Shilla dynasty that preceded the closely-named Koryo Dynasty of about 1100-1300AD that was right before the venerated Chosun Dynasty of 1392-1910 he just looked at me. Not sure if he got all of what I said, but a colleague dining with us said that I do, in fact, know about Korean history…hey, I’ve been here almost 2 years-and not all of it has been in the bars. My dates of Koryo are a little off, but if you want more information of where Korea came from take a look at urlLink this site . It has a cool timeline…just right if you’re writing a report! So, if you’re speaking with a Korean don’t be surprised if they assume that since you’re a foreigner you can’t know precisely anything about Korea–and if you do know some facts or stories from its history you can really impress them well here. Oh, and just for the record: I’m not gay. |
| 75 | 3539003 | female | 14 | indUnk | Aries | 07,June,2004 | O= optimist P= pessimist My argument with myself: P: Nooooo! Stop thinking about him! O: Why? He’s my boyfriend. I’m allowed to. P: You’re obsessing! O: What?! No I’m not. I just like him a lot. P: You’re crazy if you think it’s going to work! O: Ok… Then I’m crazy! P: So you think it will work out between you two? Do you think he’s ‘the one’? O: Yes and possibly. P: Insane! I can’t believe you! You know what is going to happen! The same thing that always happens. You will get your hopes up and then He will drop you flat on your face! O: He would never do that! Not in a million years! P: What makes you think that? O: He’s different. P: How so? Every guy you’ve ever met has hurt you in some way, and he is a guy… Right? O: Of course he is a guy! P: So how do you know HE won’t break your heart too. O: I just know. P: Oh my gosh! I don’t know why you are doing this to yourself again! |
| 76 | 3539003 | female | 14 | indUnk | Aries | 05,June,2004 | Look at me. Look at my eyes. You see a tear. Now look into my eyes as I gaze aimlessly into yours. Grief, hope, anguish, love. I know you can see it all. We lock each other in a stare, trying to freeze us in this moment. A conversation passes between us, but no words are spoken. Unsure of what to do next, I run my fingers through your hair. Then we hug. This position once so odd now becomes the most comforting thing in my entire world. Then the bell rings. Yes… the bell. Once it was my savior and now I am betrayed by the bell. We stand hand in hand as the world goes on around us. I tell you I love you. You say it back. I weekly say good-bye. I walk away, only twenty steps away. I look back to see you watching me… A fading memory. I run back and kiss you. A perpetual kiss. When we part and walk away, the warmth remains on our lips. You take with you a huge part of my life. Not only do you have my entire heart, you have my first kiss too. Remember me fondly forever, because I will never forget you. I love you. Good-bye. |
| 77 | 3539003 | female | 14 | indUnk | Aries | 05,June,2004 | I walk into the only entrance of my prison. School once was my escape, but now my confinement. We are but ruthless animals in a cage. The rules here are almost amusing. No permanent marker, no writing on clothing, no cameras. So for these last few days of Jr. High we can’t take pictures!? Outrageous! Am I supposed to remember everything? I have no clear images of these last moments with my dearest friends. Just foggy memories continuously fading. We have virtually three days left of school and NOW the teachers are piling stuff on us. Finals one after another. Last minute projects are popping out of nowhere! But on to of that, we have to deal with the pain of saying good-bye. I’m not even strong enough to face one of those problems by myself, let alone all of them. Prison. |
| 78 | 3539003 | female | 14 | indUnk | Aries | 05,June,2004 | Why do I like him? He’s funny We have so much in common He is so cute We have the same sick mind He likes friends (I know huh!) Unlike other guys he can actually match He is super sweet Never is he mad He is intelligent He is open, almost… ok he’s blunt He is witty He is NOT gay (see, not all the good ones are) And yes ladys, he is real… and taken. I dunno why he’s with me. I didn’t think he’d like a girl, well… like me. I mean, I’m Not to bright(I almost get streight Fs) I’m fat I’m ugly I’m a bitch sometimes I’m clinicly insane (no… realy) That seems like a lot to deal with, don’t you think? I don’t know what the boy sees in me. But this is like a dream come true. Every girl dreams that some day the perfect guy will show up one day, even a girl like me. Well, mericals do happen. I guess there’s hope after all. Even for a girl like me. |
| 79 | 3539003 | female | 14 | indUnk | Aries | 04,June,2004 | I’ve fallen so deep, so fast. I don’t know what to do with myself. But I know I feel so good. I love to look at him, listen to the voice of the Aries. I wish I had the courage to go up to him and kiss him. I want to, but yet fade away into the shadows of fear and questioning. Dose Aries like me? How can I tell? What if he doesn’t and I’m making a fool of myself, stumbling over my feelings. A little girl with a crush just out of her reach. What if he dose like me too? If he asks me out do I say yes? Of course I do! Even though we will not see eachother as much as we’d like, who says it cant work! Right…? But what if it won’t work… I guess we’ll have to find out… |
| 80 | 3539003 | female | 14 | indUnk | Aries | 03,June,2004 | It’s hard to think that we have less than a week together. Then, it’s over. Most of my friends will fly away together leaving me without wings… aimlessly hobbling around trying to find replacement wings and friends. I almost don’t want to go to school. I don’t want to say good bye. We can say we’ll be friends forever, but the truth is, no matter how much you hate to admitt it, people grow appart. High School is the hardest most crusal part of anyone’s life, and I’m left to face it without my sheild, my crutches, my wings. I’m alone. Clanless. Friendless. As every body stands up in High School, still hand in hand walking together, helping eachother; I wander, holding my own hand, being crushed and stepped on by the clans of friends. That’s me. Flattend to the ground. The loner. No friends, no passion, no will to live. I’m starting over. Never have I stayed in one school for more than two years… never. So every other year I am forced to start over. This is like the others, but different. This is more important to me than anything else. I can’t start over again. Mabey I’ll be invisable, never get into another relationship again. I know it will just shatter like the rest. I can’t stand to be hurt one more time. I can’t pick up the shards of my broken heart alone anymore. But I have no choice. Some people say ‘The future depends on you’, but really you depend on the future. So I’m depending on the future… that tomarrow will be a better day. I will be optomistic and spend as much time as I can with my friends, for I only have a short time with them. Untill June 9th, 2004, 3:20pm: Then it’s… Good Bye. > |
| 81 | 3539003 | female | 14 | indUnk | Aries | 03,June,2004 | Stroke my face that’s resting on your lap. Tell me it’ll be ok. Tell me we will be together forever. Tell me you love me, you’ll never forget me. We’ll still see each other all the time. Nothing will change. Look into my eyes with your sympathetic eyes. Everything will be alright. No! Stop! You’re lieing! I’ll never see you again! Everything will be different! You don’t love me! You won’t returne my calls! You’ll forget my name! Stop! Go back to the week before, we are all together in Washington D.C. Now back to reality. We are drifting away. You are together, I am alone. Stroke my face that’s resting on your lap. I’m crying. Tell me it’s ok to cry. You’re crying too. Hug me. Please dont hurt me. I love you. It’ll be ok. |
| 82 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | You may be wondering why my site is called ‘Spaghetti Wall’. It is a joke between me and some friends. We went on the Washington D.C. trip with my school. We were at a nice Italian restraunt and I got plain spaghetti noodles. One of my friends pointed out that if the noodles were cooked, they’d stick to the wall. She grabbed a handful and sure enough, the spaghetti stuck to the wall! For some reason, I couldn’t pick the noodles off the wall with my fingers. I couldn’t get a hold of them. What else could I do? I ate the spaghetti of the wall! Everybody though it was so funny! Do you remember the ‘Spaghetti Wall’? |
| 83 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | I walk into a life long summer memory that I haven’t seen for two years. Everything is so familiar. The buildings, the people, even the attitude of the whole place are the same. I love the fact that year after year I can depend on it to be it’s old comforting place. I feel like I’m in a dream. I know every face I see, even the parents, but I fear they don’t remember me. I fear I’m just a confusingly familiar face. I’ve changed so much from last time they’ve seen me up till now. I wonder if they can see that. I am curious to find out if how I am now will effect the way I function in this place differently than before. I walk into this life long summer memory, fading into a new reality. |
| 84 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | It’s amazing really. You’ve been through every season of my sole, and you still love me. You have always been there for me. You’ve never backed away, no matter how unbearable the weather was. It’s hard to find people loyal enough to go through with that. I’m just lucky I guess. THE SEASONS OF MY SOLE: Summer~ when everything is doing well. My life is going smoothly. No bumps in the road. Fall~ when my colors fall off. My life falls apart. Unlike a leaf, I don’t float down, I crash down hard. Winter~ when the rain starts to fall. My life is completely torn to peaces. I am bare. Winter is the harshest of all seasons. An emotional blizzard. Spring~ when my life starts anew. I gather myself, and often use you for support (I’m very weak at this point). My life begins to blossom. |
| 85 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | I work at a day camp full of little kids. You wouldn’t believe the stories I’ve heard from these kids! I’ve heard that someone could detach his head and still live “for reals”. This one kid was born with flames painted on his fingernails! It’s amazing how many kids still fall for the “detaching finger” trick. It’s interesting to think that to them, this is all so real. While to us it all seems so fake and ignorant. I’m beginning to wonder if what seems to be reality to me is as fake to someone else as those kids’ to me. What if what they believe is real is the actual reality, and what I believe is reality is the huge made up story. |
| 86 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | Nothing’s changed, nothing’s changed. I look about this old memory I am now re-living, and everything from the writing on the wall, to the paint job, to the people is exactly the same (with the exception of a few oddballs). I love the constancy of this beloved place is something I know I can rely on. I love the respect I get from everyone, and the old summer friends are my continuous invitation to keep coming back. Nope, nothing’s changed. |
| 87 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | What you are about to read is a “Quick Write”. A “Quick Write” is an activity we did in my eighth grade English class. Our teacher gives us a topic to write about. We get a generally short time limit to write as much as we can about this topic keeping it as legible as we can. In this “Quick Write” the topic was My Best Friend. We were to describe our best friend, or if you didn’t have one, you could write about what your best friend would be like. That was my circumstance. My best friend would be a great person, perfect in my eyes. He’d be easy to talk to. He’d always listen and He’d be great at giving advice. I wouldn’t be afraid to tell Him my deepest, darkest secrets; I’d know He would keep them. He would not be judgmental. My best friend would never ask why I am the way I am, or why I do the things I do; He would already know. He would be extremely intelligent. I would never have to explain myself to Him; He’d simply accept everything. He would never be bitter at me for my wrong doings, but would always praise me for the things I do right. He’d be the only one with an unconditional love for me. He would never play favorites; He’d treat everyone as equals. Above all, my best friend would be a Christian, a healthy Christian who was very close to God. He’d be my best friend. It was not until when I read it aloud to my mother when I noticed I chose my best friend to be male. I couldn’t explain why. Then she pointed out that every time the nameless friend was mentioned, I had capitalized the pronoun. That’s pretty uncommon in the English language, unless of course you are referring to God or Jesus. Read the “Quick Write” again. Doesn’t it seem to describe God/Jesus? |
| 88 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | You think you’ve known yourself forever and will know yourself forever… until you change. The truth is you don’t truly know yourself. Never have, never will. For you are wearing a mask… a thousand masks. And you never really change; you just remove one mask knowing another is beneath it. Yourself is your biggest, most well kept secret of all times. The only time you can ever reveal your secret is in you dreams. In your dreams all your masks are stripped from you. You see how you truly are. But when you awake, you forget your dreams, yourself, your hidden identity. You put on your thousand masks, so just incase one falls off you can never be seen. And you step into a world of masked people. You step into a masked world. |
| 89 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | Have you ever felt so horrible: like a million knifes stabbing your body all at once. The pain is imbedded in you. Not only physically, but mentally too. It seeps deeper, opening every wound in you not yet completely healed. The knives penetrate through your entire body and sloe. All pain, physical and mental, becomes one thing. One ever-lasting thing. The process to heal such a great wound could take more than a lifetime. Death seems so relieving right now. But then you just freeze. You are iced. You can’t think, but thoughts are racing through your mind. You can’t speak, but you have so much to say. You can’t move, but you just can’t stay still. You don’t feel. Nothing seems real. You slip into a conscious coma. A stage of literal nothingness, with a numb pain. |
| 90 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | “That’s not fair!” I work with a ton of little kids, so this statement is too well known. My response is usually, “Life’s not fair.” One night after a completely “unfair” day, I thought about it. I thought about how you could explain life’s not fair. I decided to take a religious stand on this subject. I came up with this: If life was supposed to be fair, we’d all go to hell. Do you think it’s fair that the most PERFECT man, who never sinned in his entire life, was tortured and killed for OUR sins and wrongdoings? But if life was fair, Jesus would not have died for us, therefore, we would not be saved. We would not have a chance to go to Heaven. So yes, life’s not fair, but it was never meant to be. |
| 91 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | Some people say, “It’s not where you are going that matters, it’s the journey on the way.” This may be true for some; however, there are those who see it through my eyes. What if you know the journey is so rough, so unbearable, and you know you just might not be able to make it through. Do you be an optimist and lie to yourself: tell yourself it will be great and easy, only to let yourself down time and time again? I don’t. That is why I believe you should live for the destination: the “Utopia” beyond the hardships. What sense dose it make to live for the harsh path? It makes much more sense to live for the tranquility that awaits you on the other side of it all. Journey~ Harsh Sharp Painful Unbearable Rough Intolerable Agonizing Excruciating Awful Insufferable Horrendous Insupportable Violent Ruthless Fierce Vicious Emotional Anguish Suffering Destination~ Utopia Tranquil Calm Serene Peaceful Relaxing Bliss Soothing Relief Perfect Soft Constant Lovely Wonderful Paradise Which one do you want to live for: the journey or the destination? |
| 92 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | Death can be a scary thing. It only seems like that because we know so little about it. We don’t know when our decease will take place, or where for that matter, or what is after death, although there are many opinions. Everybody is different, so I can only speak for myself, but death to me is in fact daunting. I’m not really frightened of how I die, it doesn’t matter that much to me. I’m not apprehensive about what happens to me after death, I believe my God will take me to heaven. I’m afraid of when I will die. I’m afraid that when I die, the people who I know would never have heard what I had to tell them, especially my family. I’m afraid that I’d pass before I told them I loved them so much. I couldn’t imagine their lives after I died without having a clear conscience that I loved them. I’d think they would think of my last memory of them trying to remember if it was a positive memory. Nobody wants to be remembered pessimistically. I want them all to know that I appreciate them all so much and I love them with all my heart. |
| 93 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | I love to dance. I love the natural high I get from dancing. My movement is so limitless. Every emotion can be told through dance. I can’t even begin to explain the rush I get from dancing. I get so lightheaded. I don’t care what other people think. I don’t care what I look like, or what I’m doing. I become so incredibly care free. I am never free from stress anymore. I’m constantly working, and am always stressed out. I can never get a break! But when I’m dancing, I stop thinking. Everything lifts from my shoulders. I can move freely. There is absolutely nothing on my mind when I’m dancing, and I get light headed and dizzy. When I need to escape, I turn on the radio and wait for a good song to come on. Then I dance until I can hardly move anymore. Everything is forced from me. No more anger, anxiety, depression, stress, ECT. I am completely tranquil. Dance. |
| 94 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | People are always saying, “Just be yourself”. When you say, “I don’t know how to act in this situation”, they always say, “Just be yourself”. But what if “yourself”, I mean the way you truly are when no one is watching, it not what they wanted or expected. You are constantly rejected for “yourself”. How would anybody attain friends or a significant other? Certainly not by being you. So according to the surrounding people, be it at home, at church, with your friends, you put on an act to satisfy their wants in a person. Like a chameleon, you change your colors, attitudes, style, personality, completely change yourself to match and blend in to your surroundings. So when people say, “Just be yourself,” think about it. They are really saying “Look, act, be… how I want you to be.” |
| 95 | 3539003 | female | 14 | indUnk | Aries | 11,August,2004 | Have you ever had the feeling you were alone? Even though the whole world is surrounding you, they aren’t really with you at all. I was sitting next to my best friend and her boyfriend who at that time was my crush. I look down at their hands, holding each other’s hands. I look down at my hands holding nobody’s but my own. As I was walking to my next class, with my arms crossed, I saw them again. This time they were hugging. Again I look at myself to see that I was in fact hugging nobody but myself. I really shouldn’t feel so alone, but for some unknown reason I do. It is painful, being alone and all. But I know there’s someone out there for me too. Actually, I KNEW there was someone out there for me, until I met my love. I finally found him. There is nobody that could compare to him. Until fate breaks us apart, which we both know will happen soon, I’m NOT alone. For once, I’m not alone. |
| 96 | 4172416 | female | 25 | indUnk | Capricorn | 08,August,2004 | urlLink im new to this, CAN you tell!- god how do i create a links column |
| 97 | 4172416 | female | 25 | indUnk | Capricorn | 08,August,2004 | Election time has rolled around again. And everyone is spitting their venom at each other in hopes of impressing the public with their ‘tact’. Um..what. Instead of pointing your greedy little fingers and accusing Mr So and So, why don’t you do one better, and tell us what YOU’RE going to do for us. Im sick of hearing what he did, how badly they did it, or how they didnt but should have. Says YOU. I guess it isnt like anyone is bound to believe anything you white collared putzs have to say anywya. You always do the opposite. I’ve been trying to educate myself on the different political platforms to figure out which I can identify with most. Personally, for some, I think lying would gather more votes. Shit, no wonder the libertarians never make it into office. To blatantly state ‘We should eliminate the entire social welfare system. This includes eliminating AFDC, food stamps, subsidized housing, and all the rest. Individuals who are unable to fully support themselves and their families through the job market must, once again, learn to rely on supportive family, church, community, or private charity to bridge the gap.’ …is a ballsy move. Oh? You thought that was it? Theres more: ‘The bulk of your welfare tax dollars goes to pay the handsome salaries of well-educated welfare workers. The poor get little from government welfare except meager handouts and a cycle of despair.’ Again…ummmm, what? So, let me get this right….government should stop paying the ‘meager handouts’ to families in need, and instead have those families go begging churches and charities for even smaller handouts..because…those ‘meager’ handouts are causing the moneywell to go dry? But hey, it’s ok to throw cash at foreign countries and help THEIR people out. Who cares that most of them deteste the United States, and somewhere in their hearts pray for our annihilation. I bet those little bastards setting our flag on fire and stomping all over it after 9/11, aren’t aware that our money covered their birth. Nothing better than paying for the future of terrorism. Maybe i’m taking this personally, as I should. Because without the ever so popular ‘welfare system’. My mother and myself might have suffered. Blah. I can see you all thinking that your precious greens have in some way helped me become the fine young woman I am today. BZZZZZZZT. Think again. The sum of monies my mother received could be accounted for with taxes paid out from my immediate family alone. She could have been on welfare for another 20 years, and technically, taxes paid by my extended families would have covered it. So i’ll just thank them, and give you the middle finger for being arrogant enough to think that you have had any impact on my families financial prosperity. And now the shoe is on the other foot, or however it goes, and my taxes are paying for your sisters-mothers-best-friends-brother-nieces illegitimate kids. And to ease your silent curiousity, my mother is now a very successful businesswoman. So suck my tampon string. Here is another Liberal issue, surely the cure to the nations healtcare AND war on drugs. Let’s just legalise it!!! It’s typical of the liberal party to promote the legalisation of drugs isn’t it. I mean, for their stance on issues alone, they have to be smoking or snorting something. The liberals take on anything seems to be ‘We can’t control it so lets just legalise it and save ourselves the time and money it would take to correct it.’ If you see it that way, you’re a plain idiot. Does rehabiliation not cost money? Do damages not cost money? Ok, give some kid easy access to some pot, so they can fall asleep at the wheel and crash into a building. Or kill someone. ‘Drugs should be legal. Individuals have the right to decide for themselves what to put in their bodies, so long as they take responsibility for their actions.’ Ok, well when these users are irresponsible and their actions endanger other people, .. THEN WHAT? Let me guess, have their parents bear the responsibility. Riiight. ‘Addicts out of control were a tiny minority.’ Um…So was obesity. See how thats change. And food IS legal! I can almost picture it…1972, after a demonstration on public sex, and the art of smoking pot through your girlfriends vagina, a group of teenage dropouts were arrested and spent a night in jail where they concocted this notion that if everything were legalised, the standard of their living would be improved and fuck consequences of drug induced peoples, fuck work performance, fuck cost of consequences, fuck everyone else. And the liberal party was formed. |
| 98 | 3668238 | female | 17 | Student | Gemini | 30,June,2004 | http://www.uploadimages.net/images/648898PICT0074.JPG http://www.uploadimages.net/images/789165PICT0037.JPG http://www.uploadimages.net/images/456053PICT0066.JPG my shoes travel far and wide http://www.uploadimages.net/images/322394PICT0096.JPG one day i will look nice in a pic http://www.uploadimages.net/images/149883PICT0026.JPG banana phone!!! i hav loads more but will put em on a site or somit later, ~Disturbed Angel - Beautifully Decayed~ |
| 99 | 3668238 | female | 17 | Student | Gemini | 26,June,2004 | it was fun :) Hey dad I’m writing to you, not to tell you that I still hate you, just to ask you how you feel and how we fell apart, how this fell apart. Are you happy out there in this great wide world? Do you think about your sons? Do you miss your little girl? When you lay your head down how do you sleep at night? Do you even wonder if we’re all right. We’re alright. We’re alright. It’s been a long hard road without you by my side. Why weren’t you there all the nights that we cried? You broke my mother’s heart, you broke your children for life. It’s not ok but we’re alright. I remember the days you were a hero in my eyes. But those were just a long lost memory of mine. I spent so many years learning how to survive. Now I’m writing just to let you know that I’m still alive. The days I spent so cold, so hungry, were full of hate, I was so angry, the scars run deep inside this tattooed body. There’s things I’ll take to my grave, but I’m ok, I’m ok. It’s been a long hard road without you by my side. Why weren’t you there all the nights that we cried? You broke my mother’s heart, you broke your children for life. It’s not ok but we’re alright. I remember the days you were a hero in my eyes. But those were just a long lost memory of mine. I spent so many years learning how to survive. Now I’m writing just to let you know that I’m still alive. And sometimes I forgive, yeah and this time I’ll admit that I miss you, said I miss you. It’s been a long hard road without you by my side. Why weren’t you there all the nights that we cried? You broke my mother’s heart, you broke your children for life. It’s not ok but we’re alright. I remember the days you were a hero in my eyes. But those were just a long lost memory of mine. I spent so many years learning how to survive. Now I’m writing just to let you know that I’m still alive And sometimes I forgive, yeah and this time I’ll admit that I miss you, said I miss you…hey dad -emotionless ring ring ring ring ring ring ring - banana phone! Rose said: ‘When love abandons and hate dissipates, just a memory commemorates, a time spent not to forget, anger fades and there’s no regret. done and dusted, emotion evades me, a place i trusted, left safety to save me.’ Rose says: she no longer bleeds. |
| 100 | 3668238 | female | 17 | Student | Gemini | 24,June,2004 | war is every why are we so fixated with it?? its in present life, history lessons n now english lessons. tho i do love Brooke’s work i guess i can’t complain i get to study somit i love. and the book we’re readin is quite good - ‘birdsong’ by Sebastian Faulks… ‘A few months later, on his next visit, he took Isabelle for a walk in the garden and told her that he was being posted abroad and that he was not in a position to continue their friendship. he skated round the question of marriage with pleas of poverty and unworthiness. isabelle didn’t care whether he married her or not, but when he said he would not see her again she felt the simple agony of bereavement, like a child whose only source of love has gone. for three years her loss coloured every moment of her day. when at last it became bearable it was still like a wound on which the skin would not thicken, so the least thing could reopen it. the reckless innocence of her unguided childhood was finished, but eventually a sweetness and balance in her nature returned.’ this section stands out to me. the book has yet to mention war actually lol but a very steamy (and by the woman’s description - very satisfying) scene! ‘she was choking with passion for him, but he frightened her. she wanted to comfort him but also to be taken by him, to be used by him.’ hmm i like this book lol newas gota go watch the match! sarah xxx |
The Blog
Authorship Corpus dataset consists of over 600,000 posts from
19,000+ bloggers scrapped from www.blogger.com by user Rachael Tatman.
Approximately 35 posts (7,250 words) were collected per blogger.com
user, with their self-provided gender, age, and astrological sign (we
will not be using this variable). As we can see in this preview, blog
posts are often informal and diary-style entries, much like what we see
in other popular social media apps. There is generally an equal amount
of male and female bloggers per age group. All formatting has been
stripped already and links within a blog post are denoted as
urllink.
Let’s further explore what information our data set provides!
# Checking our total number of obs and variables
dim(blog_data)
## [1] 681284 7
Here, we can see that the data set contains exactly 681,284 rows and
7 columns. So, that means the data set includes 681,284 blog entries and
7 variables for us to work with. Since many of those variables
(id, sign, date) are
insignificant and age is the response variable, we will
only have a remaining of 3 predictor variables. Therefore, we will
definitely have to add predictor variables to our data set. To do this,
we can use natural language processing on our text
variable.
Also, we can see that there is an insurmountable amount of blog entries for us to work with, so we can reduce the total amount of data we are working with to allow the code to run with less difficulty.
Perhaps we can do this by only examining one blog entry per user. Let’s check how many unique IDs there are in this data set.
# Counting unique user IDs
blog_data %>%
distinct(id) %>%
count()
## n
## 1 19320
We can see that there are 19,320 bloggers with around 35 posts each, so let’s reduce our data set to have 1 post from 10,000 of the bloggers. This will still leave us with a substantial amount of data to train our model with.
# Creating a new data set with only one entry per blogger
blog <- blog_data %>%
distinct(id, .keep_all=TRUE)
# Reduce data set to just 10000 of the bloggers
blog <- blog[1:10000,]
# Preview reduced data set
head(blog) %>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "200px")
| id | gender | age | topic | sign | date | text |
|---|---|---|---|---|---|---|
| 2059027 | male | 15 | Student | Leo | 14,May,2004 | Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html. |
| 3581210 | male | 33 | InvestmentBanking | Aquarius | 11,June,2004 | Thanks to Yahoo!‘s Toolbar I can now ’capture’ the URLs of popups…which means now I can show you some cool links to Korean Pop (K-Pop) audio and video without the need to relate instructions like: ‘go to the site, click on the pop-audio button then choose…’. So, without further ado here is the link to 24-hour K-Pop urlLink audio and the urlLink video streaming. Enjoy. |
| 3539003 | female | 14 | indUnk | Aries | 07,June,2004 | O= optimist P= pessimist My argument with myself: P: Nooooo! Stop thinking about him! O: Why? He’s my boyfriend. I’m allowed to. P: You’re obsessing! O: What?! No I’m not. I just like him a lot. P: You’re crazy if you think it’s going to work! O: Ok… Then I’m crazy! P: So you think it will work out between you two? Do you think he’s ‘the one’? O: Yes and possibly. P: Insane! I can’t believe you! You know what is going to happen! The same thing that always happens. You will get your hopes up and then He will drop you flat on your face! O: He would never do that! Not in a million years! P: What makes you think that? O: He’s different. P: How so? Every guy you’ve ever met has hurt you in some way, and he is a guy… Right? O: Of course he is a guy! P: So how do you know HE won’t break your heart too. O: I just know. P: Oh my gosh! I don’t know why you are doing this to yourself again! |
| 4172416 | female | 25 | indUnk | Capricorn | 08,August,2004 | urlLink im new to this, CAN you tell!- god how do i create a links column |
| 3668238 | female | 17 | Student | Gemini | 30,June,2004 | http://www.uploadimages.net/images/648898PICT0074.JPG http://www.uploadimages.net/images/789165PICT0037.JPG http://www.uploadimages.net/images/456053PICT0066.JPG my shoes travel far and wide http://www.uploadimages.net/images/322394PICT0096.JPG one day i will look nice in a pic http://www.uploadimages.net/images/149883PICT0026.JPG banana phone!!! i hav loads more but will put em on a site or somit later, ~Disturbed Angel - Beautifully Decayed~ |
| 4030905 | female | 17 | Student | Aries | 31,July,2004 | Met Lim this morning, then we went to SPC to get help from Raymond.MATHS..no choice la..cos there’s a maths quiz coming up and our lecturer is damn boring. Seriously, yesterday was the very first time I really can listen and pay attention during maths with minimal distraction. Went to TP in the afternoon with Janice and Mandy..studed in the library till it close..Badminton at 5pm, TP sports hall..I only managed to do little stuff at TP cos I can’t study outside..many pple plus distractions..so irritating. Headed for TM after that..Went NTUC to buy fruits and bars. Then Janice didn’t allow me to go home. So accompanied her to YAMAHA..she wanna buy a new guitar. Rich girl. |
Next, let’s add a variable to categorize bloggers into our binary classification groups. Also, we can remove dollar signs from our text as that may cause an issue in the future.
# Adding factor variable age_group with 2 levels
blog <- blog %>%
mutate(age_group = case_when(age < 18 ~ "under", # Ages under 18
age >= 18 ~ "over")) %>% # Ages over 18
mutate(age_group = as.factor(age_group)) %>%
mutate(text = gsub("\\$","",text)) # Remove dollar signs from text
To break down text into statistical data we can work
with, we will be using natural language processing methods such as word
segmentation (tokenization), sentiment analysis, terminology extraction,
and sentence breaking.
First, lets compare the word frequencies from both groups.
# Splitting groups into two individual data sets for comparison
under_group <- blog[which(blog$age< 18),]%>%
mutate(wordcount = sapply(strsplit(text, "[A-z]\\W+"), length))
over_group <- blog[which(blog$age >= 18),]%>%
mutate(wordcount = sapply(strsplit(text, "[A-z]\\W+"), length))
# Separating chunks of texts into individual words
# Ages Under 18
text_under <- tibble(id = under_group$id,
text= under_group$text)
tokens_under <- text_under %>% unnest_tokens(word, text)
# Ages 18 and Over
text_over <- tibble(id = over_group$id,
text= over_group$text)
tokens_over <- text_over %>% unnest_tokens(word, text)
# Comparing the word frequencies of both groups
# Counting word frequencies for both groups
frequency <- bind_rows(mutate(tokens_under, age = "under"), # binding the 2 data frames together
mutate(tokens_over, age = "over")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(age, word) %>%
group_by(age) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = age, values_from = proportion) %>% # reshaping data frame
pivot_longer(`under`,
names_to = "age", values_to = "proportion")
# Plotting the frequency comparison
ggplot(frequency, aes(x = proportion, y = `over`,
color = abs(`over` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.01),
low = "darkslategray4", high = "gray75") +
facet_wrap(~age, ncol = 2) +
theme(legend.position="none") +
labs(y = "18 and over", x = NULL)
Words that are close to the line in these plots have similar frequencies in both sets of texts. For example, both groups have high frequencies of words such as “about”, “actually”, and “for”.
Words that are far from the line are words that are found more in one set of texts than another. For example, in the Age Group 18+ uses words such as “husband”, “bowie”, and “analysis” more frequently compared to Ages under 18. We also see that Ages under 18 more frequently uses words such as “lol”, “cya”, and “den” (short form of the word “then”).
We can also create word clouds from the package wordcoud
to visually see the 100 most frequent words with each group. We will
remove stop words from showing up in the word cloud (i.e. insignificant
words such as “the”, “and”, “for”).
# Loading wordcloud package
library(wordcloud)
# Visualizing most frequent words for Ages under 18
tokens_under %>%
anti_join(stop_words) %>% # remove stop words
count(word) %>%
with(wordcloud(word, n, max.words = 100))
# Visualizing most frequent words for Ages 18 and over
tokens_over %>%
anti_join(stop_words) %>% # remove stop words
count(word) %>%
with(wordcloud(word, n, max.words = 100))
With these results, we can conclude that abbreviations/slang may be an indicator of a user that is under 18. We can even see in the word cloud that a lot of words that are not in the english dictionary appear in the most frequent words for ages 18 and under. Thus, let’s add a predictor variable counting the slang and abbreviations in a text from a collection of words I self accumulated.
# Adding variable `slang`
blog <- blog %>%
mutate(slang = str_count(text, "omg")+str_count(text, "cuz")+str_count(text, "coz")+str_count(text, "wtf")+str_count(text, "lol")+str_count(text, "rofl")+str_count(text, "lmao")+str_count(text, "hmu")+str_count(text, "lmfao")+str_count(text, "yolo")+str_count(text, "tmi")+str_count(text, "nvm")+str_count(text, "jk")+str_count(text, "imo")+str_count(text, "tbh")+str_count(text, "GOAT")+str_count(text, "bruh")+str_count(text, "bae")+str_count(text, "slay")+str_count(text, "fleek")+str_count(text, "BFF")+str_count(text, "havnt")+str_count(text, "tsk")+str_count(text, "hehe"))
The results of the frequency graph shows that life - stage terms such as “husband” may be indicative of age. Perhaps, we could add a a few variables for different stages of life such as school, career, and family.
# Counting school-related terms
blog <- blog %>%
mutate(school = str_count(text, "school")+str_count(text, "class")+str_count(text, "teacher")+str_count(text, "tutor")+str_count(text, "campus")+str_count(text, "university")+str_count(text, "college")+str_count(text, "freshman")+str_count(text, "study")+str_count(text, "semester"))
# Counting career-related terms
blog <- blog %>%
mutate(career = str_count(text, "career")+str_count(text, "job")+str_count(text, "manager")+str_count(text, "boss")+str_count(text, "employ")+str_count(text, "occupation")+str_count(text, "profession")+str_count(text, "internship")+str_count(text, "coworker")+str_count(text, "work")+str_count(text, "wage")+str_count(text, "overtime")+str_count(text, "recruit")+str_count(text, "resume")+str_count(text, "startup")+str_count(text, "WFH")+str_count(text, "meeting")+str_count(text, "portfolio")+str_count(text, "tax"))
# Counting family-related terms
blog <- blog %>%
mutate(family = str_count(text, "husband")+str_count(text, "wife")+str_count(text, "child")+str_count(text, "daughter")+str_count(text, "son")+str_count(text, "marriage")+str_count(text, "kid")+str_count(text, "fiance"))
People who are interested in certain topics, such as politics, may be of a higher age group so let’s also count politics related terms.
# Counting Political Words
blog <- blog %>%
mutate(political = str_count(text, "president")+str_count(text, "country")+str_count(text, "congress")+str_count(text, "alliance")+str_count(text, "vote")+str_count(text, "election")+str_count(text, "corrupt")+str_count(text, "regime")+str_count(text, "candidate")+str_count(text, "democracy")+str_count(text, "nominee")+str_count(text, "partisan")+str_count(text, "politic")+str_count(text, "democrat")+str_count(text, "republican"))
People who are over 18 may also be more likely to incorporate profanity into their posts so we should also add this as a variable.
# Counting curse words in each text
blog <- blog %>%
mutate(profanity = str_count(text, "fuck")+str_count(text, "crap")+str_count(text, "bitch")+str_count(text, "damn"))
Next, we will use the SentimentAnalysis package to
analyze the sentiment score for each text. This package allows us to
perform sentiment analysis by converting the text into a
machine-readable format. It starts by executing a series of
preprocessing operations: text tokenization, stopword removal, stemming,
removal of punctuation and conversion to lower-case. It uses a
dictionary-based approach to generate lists of positive and negative
words. The respective incidences of these words are then combined into a
single sentiment score.
We will do this since it may be possible that different age groups tend to share posts with more negative or positive attitudes. So let’s take this into account as well into our prediction.
# Extract dictionary-based sentiment according to the QDAP dictionary
blog <- blog %>%
mutate(sentiment = analyzeSentiment(text)$SentimentQDAP)
It is also possible that other characteristics such as punctuation usage, blog length, and link usage frequency may differ with age.
So first let’s add variable wordcount to count up the
number of words per blog post. We will do this by using function
strsplit() to split a string up by spaces and then using
function sapply(str, length) to retrieve the length of the
split vector.
# Counting number of words
blog <- blog %>%
mutate(wordcount = sapply(strsplit(text, "[A-z]\\W+"), length))
Next, we can use this new wordcount variable to
calculate the ratio of punctuation to words.
# Calculating punctuation to word ratio
blog <- blog %>%
mutate(punc_ratio = str_count(text, "[:punct:]")/wordcount)
Lastly, we will count how often bloggers included a link into their
blog post. Recall that links within a blog post have been stripped of
their original formatting and are denoted as urllink. From
the wordcloud, it seems that people ages 18 and over tend to add more
links to their posts than ages 18 and under so we can add a predictor
variable for this.
# Adding variable to count links in each post
blog <- blog %>%
mutate(linkcount = str_count(text, "urlLink"))
We should first see if we have any missing data since this may cause
potential issues in later parts of our data analysis and model building.
To do this, we can use the vis_miss() function.
# Create a graph calculating percentage of missing data
vis_miss(blog)
So we can see that for the most part, there is no missing data. We see some missing values for sentiment, this may be due to the fact that no words in the sentiment QDA dictionary were detected in those blog texts so no sentiment score could be assigned to them. Since it is such a low proportion of the observations, it shouldn’t cause much of a problem. We can solve this issue by imputing missing values using linear regression from another variable when we create our recipe.
Now that we have all our other predictor variables added and analyzed missing data, we can proceed to analyzing the relationships between some variables with our outcome variable through data visualizations.
# Plotting slang and abbreviation counts to age
blog %>%
ggplot(aes(x=age, y=slang)) +
ylim(0, 15)+ # setting a smaller y limit to see points better
geom_jitter(width = 0.5, size = 1) +
geom_smooth(method = "lm", se =F, col="darkred")+
labs(title = "Age versus Slang Usage")
We can see that there is definitely a higher rate of slang usage with
those that are younger in age. That is, there is a negative correlation
between age and slang. This is what we had
expected from our previous frequency plots and wordcloud plots. So using
this as a predictor variable will be beneficial for accuracy.
Let’s calculate our average blog length for both age groups by adding up all word counts and dividing by number of users. Then, we can plot this side by side to see if there is a significant difference.
# Average blog length for Ages under 18
avg_under <- sum(under_group$wordcount) / length(under_group$id)
# Average blog length for Ages 18 and over
avg_over <- sum(over_group$wordcount) / length(over_group$id)
average_length <- tibble(age_group = c("under", "over"), average = c(avg_under, avg_over))
# Plotting average words per Blog
ggplot(average_length, aes(x=age_group, y=average, fill=age_group))+
geom_bar(stat="identity") +
labs(x = "Age Group", y = "Average Words per Blog")
So we can see that age group 18 and over tend to have a higher word count on average versus those who are under 18. Although it is not a significant difference, we could use this along with other predictor variables to predict that posts with over 300 words have a high chance of being written by someone 18 and over. This may be due to the reason that those under 18 tend to use more abbreviations, thereby decreasing their overall word count.
# Plotting Sentiment Score to Age
ggplot(blog, aes(y = sentiment, x = age)) +
geom_point() + geom_smooth()
From this graph, it seems that the sentiment score has a higher tendency to be more extreme for ages 13-27 compared to those ages 30+. Whilst for age 30+, the text seems to be more neutral and positive. Overall, however, the sentiment score maintains a pretty similar level across all ages.
Let’s explore how many observations of each age group is in this data set. Ideally, we would want a approximately even amount of both age groups.
blog %>%
ggplot(aes(x=age_group)) +
geom_bar() +
labs(x = "Age Group", y = "# of observations")
So we can see that there are much more observations of the
over group. However, there are substantial amount of data
from both groups (4,000+) so as long as we stratify our data by
age_group, it should be adequate.
Let’s examine the relationships between all our numeric predictor variables!
# C
cor_plot <- blog %>%
select(wordcount, linkcount, sentiment, profanity, slang, political, school, family, career, punc_ratio) %>%
correlate()
rplot(cor_plot)
We can see that many of the variables are positively correlated. The
most correlated variables are family and
wordcount as well as school and
wordcount. This shouldn’t be a big problem as it logically
makes sense that a higher word count would correlate to a higher chance
of these life stage related words appearing in them. It is probably that
if one of these words are used, it will be likely that it is reiterated
since it could be the topic at hand. The other variables seem to be
either not correlated at all or only be slightly correlated so that
shouldn’t cause a huge issue during model building.
We have now finish preparing all our variables for model building! We can now start setting up our models by performing out train/test split, creating our recipe, and generating folds for a k-fold cross validation.
To preface any model building, we have to first do a train/test split
on our data. Let’s do a 70/30 split on our blog data,
stratifying on variable age_group. This means that 70% of
our data will go into the training set while the other 30% will go
towards our testing set. Our strata variable age_group will
guarantee that the proportions of both age groups are distributed
equally for both training and testing set. That is, it prevents one of
the test/train split from having too many of one age group and none of
the other age group. Also, we have perform a test/train split to prevent
over-fitting since we are not using all of the data to learn. A 70/30
split allows us to have a substantial amount of data to train our models
on while still have enough to test it on.
# Set seed so we can reproduce our results
set.seed(1213)
# Performing a 70/30 Train/Test Split
blog_split <- initial_split(blog, prop = 0.7, strata = age_group)
blog_train <- training(blog_split)
blog_test <- testing(blog_split)
We should now verify that the data was split correctly.
dim(blog_train)
## [1] 6999 18
dim(blog_test)
## [1] 3001 18
So the training set has approximated 70% of the data while the testing data has approximated 30%. So, the data has been split exactly as we wanted it to.
For all our models, we will be using the same predictor variables, outcome variable, and conditions so we should now create a recipe that we can universally use for all models. Each model will use this recipe to train our prediction model.
We will be using 10 predictor variables: slang,
school, career, family,
sentiment, wordcount, profanity,
punc_ratio, linkcount, political.
From the exploratory data analysis, we have found that it should be fine
to use all predictor variables since none of them are that highly
correlated.
We will center, scale, and normalize all our predictors. We will also
impute missing sentiment data during recipe creation.
# Recipe Creation
blog_recipe <- recipe(age_group ~ linkcount + wordcount+ sentiment + slang + political + punc_ratio + career + family + profanity + school, blog_train) %>%
step_center(all_predictors()) %>% # centering
step_scale(all_predictors()) %>% # scaling
step_normalize(all_numeric_predictors()) %>% # normalizing
step_impute_linear(sentiment, impute_with=imp_vars(punc_ratio)) #imputing sentiment
# Prepping Recipe
prep(blog_recipe) %>% bake(new_data = blog_train) %>%
group_by(age_group) %>%
summarise(count = n())
## # A tibble: 2 × 2
## age_group count
## <fct> <int>
## 1 over 4029
## 2 under 2970
# Previewing our recipe
prep(blog_recipe) %>% bake(new_data = blog_train)
## # A tibble: 6,999 × 11
## linkc…¹ wordc…² senti…³ slang polit…⁴ punc_…⁵ career family profa…⁶ school
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.476 -0.559 -0.642 -0.208 -0.129 -0.103 -0.426 -0.297 -0.187 -0.328
## 2 -0.280 0.868 0.627 -0.208 -0.129 -0.133 0.927 0.976 -0.187 -0.328
## 3 -0.280 -0.0392 -1.07 -0.208 -0.129 -0.110 -0.426 0.212 -0.187 -0.328
## 4 -0.280 -0.191 0.485 -0.208 -0.129 -0.100 -0.426 -0.297 -0.187 -0.328
## 5 -0.280 1.56 -0.343 -0.208 1.56 0.237 9.05 0.976 0.419 1.79
## 6 -0.280 0.155 -0.372 -0.208 -0.129 -0.0209 1.38 0.212 -0.187 -0.328
## 7 -0.280 -0.191 -0.276 -0.208 -0.129 -0.122 0.0251 -0.297 -0.187 -0.328
## 8 -0.280 0.519 0.368 -0.208 -0.129 -0.113 -0.426 0.722 -0.187 1.79
## 9 -0.280 -0.405 0.417 -0.208 -0.129 -0.108 -0.426 -0.0426 -0.187 -0.328
## 10 -0.280 -0.525 -0.642 -0.208 -0.129 0.155 -0.426 -0.297 -0.187 -0.328
## # … with 6,989 more rows, 1 more variable: age_group <fct>, and abbreviated
## # variable names ¹linkcount, ²wordcount, ³sentiment, ⁴political, ⁵punc_ratio,
## # ⁶profanity
We will now create 10 folds, again stratifying of
age_group, to conduct a 10-fold stratified cross
validation. This means that R will designate each observation in the
training data to one of the 10 folds. With each fold, a testing set will
be created made up of that folds and the other k-1 folds are used as the
training set for that fold. We will ultimately have k total folds at the
end.
We use a K-fold cross validation since it gives a better idea of perfomance accuracy rather than just fitting and testing models.
# Creating folds
blog_folds <- vfold_cv(blog_train, v=10)
It has finally come time to build our models! As stated in the
project outline, we will be trying nine different machine learning
techniques with the same recipe and then comparing results to see which
model achieved the highest accuracy. I will be using
roc_auc as the metric for all models since it will give us
the highest level of efficiency for our binary classification model when
the data is not balanced perfectly. roc_auc calculates the
area under the receiver operating characteristic curve, which magnifies
the performancy of our classification model at all classification
thresholds.
Some of models will take a long time to run and we have many models to run; Thus, I will run my models outside of the .Rmd, store the results, and then load them in my .Rmd to avoid rerunning the models every time.
Each model follows a very similar process in building. I will detail each step before performing them:
# Logistic Regression
log_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Lasso Regression
# Tuning penalty and setting mixture to 1 for lasso regression
lasso_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
# Ridge Regression
# Tuning penalty and setting mixture to 0 for ridge regression
ridge_spec <- logistic_reg(mixture = 0, penalty = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
# Elastic Net
# Tuning both penalty and mixture
enlm_spec <- logistic_reg(mixture = tune(), penalty = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
# K Nearest Neighbors
# Tuning the number of neighbors
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
# Quadratic Discriminant Analysis
qda_model <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
# Linear Discriminant Analysis
lda_model <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
# Gradient-Boosted Trees
# Tuning mtry, trees, and learn_rate
bt_spec <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Random Forest
# Tuning mtry, trees, and min_n
rf_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
# Logistic Regression
log_wkflow <- workflow() %>%
add_model(log_model) %>%
add_recipe(blog_recipe)
# Lasso Regression
lasso_wkflow <- workflow() %>%
add_model(lasso_spec) %>%
add_recipe(blog_recipe)
# Ridge Regression
ridge_wkflow <- workflow() %>%
add_model(ridge_spec) %>%
add_recipe(blog_recipe)
# Elastic Net
enlm_wkflow <- workflow() %>%
add_model(enlm_spec) %>%
add_recipe(blog_recipe)
# K Nearest Neighbors
knn_wkflow <-workflow() %>%
add_model(knn_model) %>%
add_recipe(blog_recipe)
# Quadratic Discriminant Analysis
qda_wkflow <- workflow() %>%
add_model(qda_model) %>%
add_recipe(blog_recipe)
# Linear Discriminant Analysis
lda_wkflow <- workflow() %>%
add_model(lda_model) %>%
add_recipe(blog_recipe)
# Gradient-Boosted Trees
bt_wf <- workflow() %>%
add_model(bt_spec) %>%
add_recipe(blog_recipe)
# Random Forest
rf_wf <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(blog_recipe)
# Lasso Regression
penalty_grid <- grid_regular(penalty(range = c(-5,5)), levels = 40)
# Ridge Regression
# same as Lasso Regression grid
# Elastic Net
# default values of penalty
enlm_grid <- grid_regular(mixture(range=c(0,1)), penalty(), levels=10)
# K Nearest Neighbors
knn_grid <- grid_regular(neighbors(range = c(1, 20)), levels=10)
# Gradient-Boosted Trees
bt_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(50,200)),
learn_rate(range = c(-10, -1)),
levels = 5)
# Random Forest
rf_grid <- grid_regular(mtry(range = c(1, 4)),
trees(range = c(50, 200)),
min_n(range = c(10, 20)),
levels = 6)
# Lasso Regression
tune_lasso <- tune_grid(
object = lasso_wkflow,
resamples = blog_folds,
grid = penalty_grid,
)
# Ridge Regression
tune_ridge <- tune_grid(
object = ridge_wkflow,
resamples = blog_folds,
grid = penalty_grid,
)
# Elastic Net
tune_enlm <- tune_grid(
object = enlm_wkflow,
resamples = blog_folds,
grid = enlm_grid
)
# K Nearest Neighbors
tune_knn <- tune_grid(
object = knn_wkflow,
resamples = blog_folds,
grid = knn_grid,
)
# Gradient-Boosted Trees
tune_bt <- tune_grid(
bt_wf,
resamples = blog_folds,
grid = bt_grid,
metrics = metric_set(yardstick::roc_auc)
)
# Random Forest
tune_rf <- tune_grid(
rf_wf,
resamples = blog_folds,
grid = rf_grid
)
# use save() to save
# Lasso Regression
save(tune_lasso, file = "/Users/annalin/Desktop/131FinalProject/rda/tune_lasso.rda")
# Ridge Regression
save(tune_ridge, file = "/Users/annalin/Desktop/131FinalProject/rda/tune_ridge.rda")
# Elastic Net
save(tune_enlm, file = "/Users/annalin/Desktop/131FinalProject/rda/tune_enlm.rda")
# K Nearest Neighbors
save(tune_knn, file = "/Users/annalin/Desktop/131FinalProject/rda/tune_knn.rda")
# Gradient-Boosted Trees
save(tune_bt, file = "/Users/annalin/Desktop/131FinalProject/rda/tune_bt.rda")
# Random Forest
save(tune_rf, file = "/Users/annalin/Desktop/131FinalProject/rda/tune_rf.rda")
Finally, we have successfully chosen adequate tuning parameters and ran all models! Please note that since we were constricted to the MacBook-level computing power, I had to choose parameters and levels that were slightly restrictive. Still, we were able to choosing ones that should perform quite well. Now, let’s load back in the saved files to analyze our performances!
# load() to load results
# Lasso Regression
load("/Users/annalin/Desktop/131FinalProject/rda/tune_lasso.rda")
# Ridge Regression
load("/Users/annalin/Desktop/131FinalProject/rda/tune_ridge.rda")
#Elastic Net
load("/Users/annalin/Desktop/131FinalProject/rda/tune_enlm.rda")
#K Nearest Neighbors
load("/Users/annalin/Desktop/131FinalProject/rda/tune_knn.rda")
# Gradient-Boosted Trees
load("/Users/annalin/Desktop/131FinalProject/rda/tune_bt.rda")
# Random Forest
load("/Users/annalin/Desktop/131FinalProject/rda/tune_rf.rda")
The autoplot function will be extremely helpful in
visualizing the results of all our tuned models. This allows us to
visually see the effects that the change in parameters have on our
metric, roc_auc.
# Using autoplot() and setting metric to roc_auc
autoplot(tune_knn, metric = 'roc_auc')
For our KNN model, we had tuned 1-20 number of nearest neighbors at
10 different levels. We can see here that the ROC_AUC increases as the
number of nearest neighbors increases, peaking at the last number
neighbors=20 with a ROC AUC of approximately 0.685.
# Using autoplot() and setting metric to roc_auc
autoplot(tune_rf, metric = "roc_auc")
For our random forest model, we tuned 3 different parameters: 1.
mtry = number of predictors that is randomly sampled at
each split when creating the tree models. 2. trees =
numbers of trees generated in the ensemble. 3. min_n =
minimum number of data points in a node that is needed for the node to
be split further.
We can see that the number of trees does not really affect the ROC
AUC that significantly. It seems that as mtry decreases,
the ROC AUC increases (which means performance improves). As
min_n increases, it also seems like ROC AUC increases.
Thus, the best performing random forest model seems to be at
mtry = 1, min_n = 14, and
trees = 50.
# Using autoplot() and setting metric to roc_auc
autoplot(tune_enlm, metric = 'roc_auc')
For our elastic net model, we had tuned our parameters mixture and penalty to 10 different levels. From our plot, it seems that the lower the lasso penalty, the better our model performs. As the penalty increases, the ROC AUC decreases since the coefficients of the predictors are reduces to values that are too small, making it harder to predict accurately. It seems that when mixture is closer to 0, our model performs the best.
# Using autoplot() and setting metric to roc_auc
autoplot(tune_bt, metric = 'roc_auc')
For our boosted trees model, we had tuned 5 different levels of 3
parameters: 1. trees = number of trees within each
ensemble. 2. learn_rate = rate at which the boosting
algorithm adapts with each iteration, also known as the shrinkage
parameter. 3. mtry = number of predictors that will be
randomly sampled with each split when creating the tree models.
From our plot, we can see that learning rate of 0.1 seems to work the
best. For many of the learning rates, the number of trees did not have a
significant effect on roc_auc. On average, ROC AUC seems to decrease as
the number of predictors increases. Overall, the best performing boosted
tree model seems to be at mtry = 2,
learn_rate = 0.1, and trees = 50.
Now let’s compare the best ROC AUC value for each of the 9 models. We will do this by creating a tibble of the estimate for the ROC AUC of each machine learning technique’s optimal model.
# Logistic Regression
log_fit <- fit_resamples(log_wkflow, resamples = blog_folds) # fit the logistic regression model
# Select ROC AUC score using slice() and select()
log_auc <- collect_metrics(log_fit) %>%
slice(2)
# Lasso Regression
lasso_auc <- collect_metrics(tune_lasso) %>%
arrange(mean) %>% slice(80)
# Ridge Regression
ridge_auc <- collect_metrics(tune_ridge) %>%
arrange(mean) %>% slice(80)
# Elastic Net
en_auc <- collect_metrics(tune_enlm) %>%
arrange(mean) %>%
slice(200)
# K Nearest Neighbors
knn_auc <- collect_metrics(tune_knn) %>%
arrange(mean) %>%
slice(20)
# Boosted Trees
bt_auc <- collect_metrics(tune_bt) %>%
arrange(mean) %>% slice(125)
# Random Forest
rf_auc <- collect_metrics(tune_rf) %>%
arrange(mean) %>%
slice(288)
# Linear Discriminant Analysis
# Fit LDA model
lda_fit <- fit_resamples(lda_wkflow, resamples = blog_folds)
# Select ROC AUC score using slice() and select()
lda_auc <- collect_metrics(lda_fit) %>%
slice(2)
# Quadratic Discriminant Analysis
# Fit QDA model
qda_fit <- fit_resamples(qda_wkflow, resamples = blog_folds)
# Select ROC AUC score using slice() and select()
qda_auc <- collect_metrics(qda_fit) %>%
slice(2)
# Creating a tibble
compare <- tibble(Model = c("Logistic Regression", "Lasso Regression", "Ridge Regression", "Elastic Net", "K Nearest Neighbors", "Boosted Trees", "Random Forest", "LDA", "QDA"), ROC_AUC = c(log_auc$mean, lasso_auc$mean, ridge_auc$mean, en_auc$mean, knn_auc$mean, bt_auc$mean, rf_auc$mean, lda_auc$mean, qda_auc$mean))
# Arranging our models from great ROC_AUC score to least
compare <- compare %>%
arrange(desc(ROC_AUC))
compare
## # A tibble: 9 × 2
## Model ROC_AUC
## <chr> <dbl>
## 1 Random Forest 0.718
## 2 Boosted Trees 0.714
## 3 Logistic Regression 0.696
## 4 Elastic Net 0.696
## 5 Lasso Regression 0.695
## 6 Ridge Regression 0.695
## 7 LDA 0.682
## 8 K Nearest Neighbors 0.670
## 9 QDA 0.635
Let’s visualize these results.
# Creating a barplot of ROC_AUC values
ggplot(compare, aes(x=Model, y=ROC_AUC)) +
geom_bar(stat = "identity", aes(fill = Model)) +
scale_fill_manual(values = c("green1", "blue1", "green2", "blue2", "green3", "blue3", "green4", "blue4", "green1")) +
theme(legend.position = "none") +
labs(title = "Comparing ROC_AUC by Model")
From the performance of our models on the cross-validation data, it seems that Random Forest is our best performing model, followed by the Boosted Trees model that performed almost just as well! Therefore, we will be analyzing the results of both these models. We should also note that the QDA model performs much worse compared to all the other models, which suggests that the data does not follow gaussian distribution.
So our random forest model ends up being the best performing model, but which tuned parameters was chosen that yield these results?
# Using show_best() to choose the random forest model with the largest ROC_AUC
show_best(tune_rf, n = 1)
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 50 14 roc_auc binary 0.718 10 0.00626 Preprocessor1_Model0…
Here, we can see that our best performing model has parameters tuned
at mtry = 1, trees = 50, and
min_n = 14. This means it has 1 predictor, 50 trees, and
minimum node size of 14 had the best performance with a ROC AUC value of
0.7177!
Our second best performing model was the Boosted Trees model, let’s see which tuned parameters was chosen!
# Using show_best() to choose the boosted trees model with the largest ROC_AUC
show_best(tune_bt, n = 1)
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 50 0.1 roc_auc binary 0.714 10 0.00678 Preprocessor1_M…
We can see that best boosted tree models has 2 predictors, 50 trees, and a 0.1 shrinkage parameter. This came together to give us the resulting ROC AUC score of 0.7144!
Now, we will first take these two models and fit it to the training data to train them one last time to the entire training data set (rather than just one fold).
# Random Forest
# Assign best RF model to best_rf_model
best_rf_model <- select_best(tune_rf, metric="roc_auc")
# Fit to training data
rf_final_wkflow <- finalize_workflow(rf_wf, best_rf_model)
rf_final_fit <- fit(rf_final_wkflow, data = blog_train)
# Boosted Trees
# Assign best BT model to best_bt_model
best_bt_model <- select_best(tune_bt, metric="roc_auc")
# Fit to training data
bt_final_wkflow <- finalize_workflow(bt_wf, best_bt_model)
bt_final_fit <- fit(bt_final_wkflow, data = blog_train)
Now, we can continue by testing our final two models on the testing data set, which is data the model has not seen yet! This way, we can properly get insight on how the model actually performs.
# Random Forest
# Comparing predicted vs actual values
final_rf_test <- augment(rf_final_fit, blog_test) %>%
dplyr::select(age_group, starts_with(".pred")) %>%
roc_auc(truth=age_group, .pred_over) %>%
dplyr::select(.estimate)
final_rf_test # Call results
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.709
# Boosted Trees
# Comparing predicted vs actual values
final_bt_model_test <- augment(bt_final_fit, blog_test) %>%
dplyr::select(age_group, starts_with(".pred")) %>%
roc_auc(truth = age_group, .pred_over) %>%
dplyr::select(.estimate)
final_bt_model_test # Call Results
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.709
We can see that both random forest and boosted trees models performed similarly on the testing set compared to the cross-validation folds with ROC AUC scores of 0.7109 and 0.7104, respectively. Our models have done pretty well considering the difficult task we attempted to tackle of age prediction through social media posts. It could definitely be improved using more elaborate natural language processing techniques such as semantic role labeling and grammar induction. Nevertheless, these results proved to be a decent start to the possibilities of using machine learning to predict age from text!
Let’s plot our ROC curve to visualize our AUC scores. Ideally, we would want a curve that is arched as left as possible and above the diagonal line. A perfect AUC score would have a ROC curve that resembles a right angle with the intersection at the top left corner.
Stating off with our random forest ROC curve:
# Random Forest
rf_roc <- augment(rf_final_fit, blog_test) %>%
dplyr::select(age_group, starts_with(".pred"))
# Compute ROC curve
roc_curve(rf_roc, truth = age_group, .pred_over) %>%
autoplot() #plot
Then our boosted trees model ROC curve:
bt_roc <- augment(bt_final_fit, blog_test) %>%
dplyr::select(age_group, starts_with(".pred"))
# Compute ROC curve
roc_curve(bt_roc, truth = age_group, .pred_over) %>%
autoplot()
The ROC Curve plots were similar for both models, which makes sense since their AUC scores were similar as well.
Throughout the process of this project, we have analyzed and fit
several models with the goal of finding the best model to perform age
classification on an author by using one of their blog posts.
Ultimately, we came to the conclusion that the random forest model was
our best performing model with the boosted trees model following closely
behind in performance. This makes sense as there are many strengths in
random forest models that help provide a higher level or accuracy in
predicting outcomes. One beneficial feature of this type of model is its
ability to automatically balance data sets where the outcome classes may
not be equal. Since we had a substantially greater amount of data for
our over group, this feature definitely helped establish
equilibrium. Furthermore, the random forest model is nonparametric and
has greater flexibility due to it have no assumptions about the outcome.
However, the best performing model only uses one predictor variable, so
perhaps we could improve this model by exploring the data deeper to look
for more patterns and possible predictors.
We should also note that our worse performing model is Quadratic Determinant Analysis model. This is unsurprising since a QDA model assumes that the response classes are all separable and has normal distribution. Also, since each has has its own variance, it may be too flexible. Also, since this is a binary classification problem, it may be unnecessary to use more complex techniques like LDA and QDA, as logistic regression will often be adequate. Here, we can see that the simple logic regression model performed much better than the LDA and QDA models. If we had separated the age groups into more than 2 groups, we would have preferred to use the LDA and QDA classification techniques.
To further improve our performance, we could look towards applying other more complex methods such as a neural network, Support Vector Machines, or Naive Bayes. Also, we could add more predictor variables based on other linguistic differences between groups. For example, we could also add predictors for the topics of each post. Additionally, if I had used a dictionary of slang/abbreviations rather than self compiling a list of them, I could have a better coverage of what slang is out there rather than just the ones I know of / could find.
Overall, I had a great experience assembling this Age Prediction project and advancing my knowledge in machine learning! I’ve always been interested in NLP so I thoroughly enjoyed this opportunity to explore this interest further. I look forward to looking into this topic further and continue to learn more machine learning techniques.
All information can be found in the link above.